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We investigate the computational structure of a paradigmatic example of distributed so- 
cial interaction: that of the open-source Wikipedia community. The typical computational 
approach to modeling such a system is to rely on finite-state machines. However, we find 
strong evidence in this system for the emergence of processing powers over and above the 
' finite-state. Thus, Wikipedia, understood as an information processing system, must have 

access to (at least one) effectively unbounded resource. The nature of this resource is such 
that one observes far longer runs of cooperative behavior than one would expect using finite- 
state models. We provide evidence that the emergence of this non-finite-state computation 
is driven by collective interaction effects. 
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Social systems — particularly human social systems — process information. From the price- 
setting functions of free- market economies [1, 2] to resource management in traditional communi- 
ties [3], from deliberations in large-scale democracies [4, 5] to the formation of opinions and spread 
of reputational information in organizations [6] and social groups [7, 8], it has been recognized that 
' such groups can perform functions analogous to (and often better than) engineered systems. Such 

functional roles are found in groups in addition to their contingent historical aspects and, when 



described mathematically, may be compared across cultures and times. 



The computational phenomena implicit in social systems are only now, with the advent of large, 
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high-resolution data-sets, coming under systematic, empirical study at large scales. What does not 
appear to be appreciated in the literature is the extent to which the statistics of their behaviors 
inform us about the expressive capability and processing abilities of the abstract computations the 
system is, in fact, performing. 

Such constraints are important for both the prediction and conceptual understanding of observed 
phenomena. In order to make good predictions of symbolic systems, most learning algorithms 
assume that the (predictive) description being fit falls within the same model class as the observed 
process. Thus, evidence for a non-finite-state process is crucial because of the central role that 
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finite-state models — often known as Hidden Markov Models, or HMMs — currently play in the 
description and inference of complex biological and social processes. 

From the point of view of theory, the computational hierarchy (actually a partial order, see e.g., 
Ref. [9]) provides a mathematically rigorous way to classify the functional properties of a system 
independent of its material substrate, and allows one to determine when one system has strictly 
greater powers than another. From this point of view, it allows one to determine the emergence 
of novel computational abilities. A classic example of the importance of this conceptual aspect of 
the problem can be found in the persistent influence of the Chomsky Hierarchy for studies of both 
human [10] and non- human [11] communication. 

A particular consequence of the general discussion of grammatical inference to be found in 
Ref. [12] is that a finite amount of data by itself can never distinguish between two classes whose 
distinctions are defined in terms of bounded vs. unbounded resources [13]. Our argument for 
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the emergence of non-finite computational properties thus relies on the statistical inference of 
asymptotic properties of a finite-state system. 

In particular, we prove a result that we refer to as the probabilistic pumping lemma: for any 
finite-state process, and any string w, of sufficient length, produced by the process, the probability 
that a word of length \w\n is found to be w n decays exponentially as n becomes large. 

The outline of our paper is as follows. We state, and prove, the lemma described above, in Sec. I 
and Appendix V. We establish the main empirical result of this work in Sec. II, where we examine 
the symbolic dynamics of the large-scale social phenomenon of article editing in Wikipedia. In 
considering the top ten most-edited articles in the encyclopaedia, we find strong evidence in a 
majority of cases for a violation of the probabilistic pumping lemma, and thus symbolic dynamics 
over and above that of the finite-state. 

We then discuss the possible origins of this effectively resource-unbounded system in Sec. III. We 
conclude with the implications of this finding for the computational complexity of social systems in 
the concluding Sec. IV, where we compare our findings with recent work and explore the analogy 
between formal grammars and social behavior. 



We first show explicitly that probabilistic finite-state process have an exponential cutoff in the 
asymptotic distribution of repeated words. We do this by showing that the limiting ratio of P(w k ) 
(the probability of observing the word w repeated k times in a sample of length and P(w k+1 ), 

as k becomes large, approaches a constant strictly less than unity [11]. We will be able to determine 
that limiting constant in terms of the properties of the underlying system. 

Statement of Lemma. For any probabilistic finite-state process, and any initial distribution over 
internal states, there exists a positive real number e, strictly less than unity, such that 



as k becomes large, with < e < 1, e strictly greater than zero and strictly less than one. The 
limiting value, e is the spectral radius of Aij(w), the natural extension of Aij(a), the (Mealy- 
formalism) symbol transition matrix, to multi-letter words. 

The complete proof is given in Appendix V. Tests of the numerical convergence of this relation 
are presented in Appendix VI, where we study how small machines (n, number of states, of order 
ten) converge to the bound of Eq. 1 for a uniform prior over spectral radius. 

Informally, the lemma says that P(w k ) is bounded above by an exponential cutoff of the form 
e k , < e < 1. For most processes, the relevant scale for the limit to obtain is k of order p, the 
number of states in the underlying process; we present numerical evidence for this in Appendix VI. 
Under the mild assumption that the system has passed through its transient states to an aperiodic 
final class, the probability P(w k ) takes the form of a sum of exponentials, 



where here n is the number of states, and all strictly less than unity, are the eigenvalues of 
the Aij(w) transition matrix. Eq. 2, which we refer to as the nEXP model, forms the basis of our 
model comparisons, and our claims of evidence of non-finite-state computation, in the next section. 



I. THE PROBABILISTIC PUMPING LEMMA 
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II. THE CASE OF WIKIPEDIA 

We now consider a real-world example of how the probabilistic pumping lemma can be used to 
study complex processes, and to determine when the data may not be well-described by a hidden 
Markov model. 

In particular, we consider whether the coarse-grained dynamics of the editing of a single article 
on the collaboratively-edited online system Wikipedia are well-described by a finite-state model. 
The probabilistic pumping lemma of Sec. I will provide the central source of evidence against this 
claim. 

Wikipedia is an unusually successful example of collaborative, or crowd-sourced, editing; indeed, 
the distinctions between users and editors is so small that the terms can be used interchangeably. 
Despite the vast numbers of edits — including a non-trivial number of which are classified as "van- 
dalism" by other editors — and the dominance of anonymous and pseudonymous contributions, the 
accuracy of Wikipedia rivals that of more traditionally curated references such as the Encyclopaedia 
Britannica over a wide range of technical topics [15]. 

Questions about the nature of the underlying processes that allow for this level of accuracy, 
along with those about the emergence of novel institutions [16] and behavioral norms [17, 18], 
puts the study of Wikipedia in a long tradition of questions about the nature of decentralized and 
self-organizing systems. In such systems, questions of conflict dynamics and conflict resolution are 
particularly compelling, and have received a great deal of recent attention [] 9]. It is for this reason 
that, in considering the edits that are made to an individual page, we consider a coarse-graining 
of the system sensitive to the initiation and resolution of conflict. 

Coarse-graining is, of course, always necessary: the number of possible edits that editors can 
make is essentially unbounded and any edit may change, add, or delete arbitrary amounts of text 
from the article. A well-known distinction, however, exists between edits that alter the text in a 
novel fashion and those that "roll back" the text to a previous state. The latter kind of edit, called 
a "reversion" is used when an editor disagrees with an edit made by someone else and, instead of 
altering the text further, simply undoes the work of his or her opponent. Reverts are a natural class 
to consider in a study of online conflict [20-22]; as noted by Ref. [23], who studied reversion as a 
measure of conflict across multiple Wikipedia-like systems, reversions capture implicit cases of task 
conflict, which are strongly associated with the broader phenomenon of relationship conflict [24]. 

We thus coarse-grain the history of edits made on an article into two classes, C ("cooperate") 
and R ("revert"); an example of this process is shown in Table I, while the details of our processing 
of the raw data are given in Appendix VII. We ask the following: is the time series of C and R in 
the class of finite-state processes? 

A negative answer to this question has great generality, in part due to the very weakness of 
the question itself. For example, establishing that this (very) coarse-grained description of the 
processes does not lead to a finite-state model means that no finer-grained description can be 
either. As an example of this, it is clear "by eye" that many edits that fall under the cooperate 
category are actually signs of other forms of conflict — for example, partial reversions, explicitly 
conflicting edits, and so forth — but a further subdividing of the symbol space, compatible with the 
original coarse-graining, can only exacerbate the problem. 

By the same logic, if the dynamics of a single article, considered in isolation, are not finite-state, 
it necessarily implies that those describing an interacting system of articles are also non-finite-state. 
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time (UTC) 


user 


SHA1 (partial) 


code 


02:08 


Sarah 


4abc4aef lea5 


C 


05:02 


Alexh25 


Ie3a2a4656d8 


c 


05:04 


Milking 


4abc4aef lea5 


R 


11:39 


Trezatium 


3b03700b0d9c 


C 


12:15 


Brazilfantoo 


94a5c05bal0e 


C 


12:31 


Brandon39 


3b03700b0d9c 


R 


23:28 


Titoxd 


109986b8f390 


C 


23:31 


Titoxd 


334a315944ce 


C 


23:38 


Titoxd 


739cl5e5bc6a 


C 


23:40 


Titoxd 


3063a0289680 


C 


23:42 


Titoxd 


7aafc8f3f762 


C 



TABLE I. A day of edits on the George_W._Bush page, starting at midnight UTC, 21 March 2006. 
As can be seen by comparing SHAl hashes of the page content, user Mhking reverted an edit 
by user Alexh25 to the previous version by user Sarah. Later in the day, user Brandon39 re- 
verted user Brazilfantoo. In between, one can see "cooperative" stretches involving both single 
and multiple users. This sequence of events is coarse-grained into the substring "CCRCCRCC- 
CCC." The full string of (in this case) 44,955 action symbols forms the basis of the finite-state 
analysis. As with all data used in this study, this sequence is publicly available, in this case at 
http : //en. wikipedia. org/w/ index .php?t it le=George_W. _Bush&of f set=200603218&action=history. 



It is well known, for example, that editors will often encounter each other, by accident or design, on 
different articles; including these processes in the model will, again, never decrease the complexity 
of the system. 

Strictly speaking, a further condition obtains: if the collective phenomenon is not finite-state, 
then it can not be due to the interaction of a finite set of systems that are themselves finite- 
state. However, the aggregation of large numbers of finite-state systems may be able to accomplish 
non-finite-state tasks over a limited range. In this case, the combinatorics of the interactions of 
these finite-state systems are leveraged to produce a state space sufficiently large that it may be 
well-described by a system with truly unbounded resources. 

Fig. 1 shows the distribution of consecutive C edits for the most edited article in the Wikipedia 
"main space" (i.e., that set of pages supposed to constitute the encyclopedic content): that referring 
to George W. Bush, the 43rd President of the United States. We refer the reader to Appendix VII, 
where we show that N(RC k R) is a preferred estimator of the probability of repeated cooperation 
under the assumption of a finite-state process. 

Even at a glance it is clear that a single exponential — which would appear as a straight line on a 
log-linear plot — is insufficient to describe the decay of P{RC k R) as a function of k. However, visual 
inspection alone is insufficient to determine whether to prefer a sum of exponentials (Eq. 2) or a 
sum of products of exponentials (Eq. 15) (both of which would be compatible with the asymptotics 
of a finite-state process) to an explicitly non-finite-state process. 

We adopt a fully Bayesian approach to this question, and compare the simplest prediction of the 
probabilistic pumping lemma, Eq. 2, to alternative models that violate the asymptotic convergence 
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FIG. 1. Solid line: distribution of consecutive C ("cooperative") events in the edit history of the most- 
edited article on the English-language Wikipedia, George_W._Bush. Dashed line: maximum-likelihood fit 
for the asymptotic (ASY) model of Eq. 3, preferred over the sum of exponential model (nEXP) of Eq. 2. 
As derived in Appendix VII, Eq. 22, the count of prchx-sufhx free strings RC k R is a good estimator of 
P(C k ) — P(C k+1 ). We note that the log-linear scale can make expected events seem anomalous; the best-fit 
model has a 34% chance of seeing at least one string RC k R with k greater than 80. 



sig. 


page name 


history length 


AC 








ASY vs. nEXP 


< 1(T 8 


George_W . _Bush 


44,984 


18.6 


< 1(T 6 


Islam 


17,586 


14.8 


< 1CT 5 


UnitedJStates 


30,715 


12.2 




Global_warming 


19,376 


12.1 


< 1(T 4 


Wikipedia 


31,591 


11.3 




Michael_Jackson 


26,662 


10.4 


< 1(T 3 


2006_Lebanon_War 


19,510 


8.8 




Deaths_in_2009 


20,902 


7.7 


> 10 4 


Deaths_in_2007 


18,215 


-11.5 


> 10 7 


Deaths_in_2008 


19,072 


-17.5 



TABLE II. log-Evidence ratios for the ten most-edited pages on Wikipedia; eight pages show strong (p- value 
< 10 -3 ) evidence for the asymptotic (ASY) model of Eq. 3 over and above that for the sum of exponentials 
(nEXP) model for the simplest version of the Probabilistic Pumping Lemma, Eq. 2. The strongest evidence 
in favor of finite-state computation are for two of the three "death list" pages, which collate otherwise 
unrelated information from other parts of the encyclopedia. Appendix VIII gives details on the use and 
computation of £ for model selection. 

properties. In particular, we choose the following model, 

*w(^)=^n(i-£). ^ 

which we refer to as the asymptotic model, or ASY. Despite its simplicity (two parameters and 
an overall normalization) ASY appears to fit the data well over a wide range, reproducing the 
approximately-exponential cutoff at small k, but allowing for an extended tail of long cooperative 
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strings. It also explicitly violates the probabilistic pumping lemma, since 

1 



exp 



lim sup ( f logPasyft;/ 1 ) 



1. (4) 



We note in passing two alternative models, both of which share this property. The first is the 
commonly-used power-law 

which is unable to reproduce the small-/c regime and is thus strongly disfavored in model selection 
over the full range relative to nEXP. The second is the g-Pochhammer function, 

fc 

P^(w k )=Yl(l-pq l ), (6) 

i=l 

which is a better fit to the data than the power-law case, but disfavored relative to ASY. 

We compute Bayesian Evidence ratios for the nEXP and ASY model classes; explicitly, we 
compute an approximation to 

A „ , P(DlnEXP) 

where P(D|nEXP) and P(Z?|ASY) are the probabilities of finding the observed data D given 
either the nEXP or ASY as the underlying process. Details of this computation are presented in 
Appendix VIII. Table II shows that strong evidence against the nEXP model, and thus against 
the description of these series as finite-state, can be found in a majority of cases of the top-ten 
most-edited articles on the encyclopaedia. 



III. DISCUSSION 

The results of the previous section strongly suggest that the social process of collaborative 
editing on Wikipedia is not finite-state. 

The cases in Table II for which this is not the case are themselves of interest. These articles are 
of a very different nature ("death lists," collections of single sentences listing the dates of deaths 
of noteworthy individuals). That these cases are better described by the sum-of-exponentials 
model — in some cases, with extremely good evidence — suggests that the article content is relevant 
to the emergence of non-finite-state computation. This can be either because the user bases that 
particular content-types attract make it easier for the resultant system to produce non-finite-state 
behavior. Or, conversely, it could be that the article content itself leads to non-finite-state editing 
patterns. 

As concrete — but, we emphasise, toy — examples of these two hypotheses, consider first Loop_quantum_gravity 
vs. Alternative_medicine. Both articles have a similar in-depth style of presentation, but will 
attract different communities of users and editors. These differing communities may hold social 
norms that lead to comparatively greater or lesser difficulties in producing the kinds of complex 
social phenomena that lead to non-finite-state time-series. 

Conversely, consider Loop_quantum_gravity vs. List_of _physicists. In this case, the differing 
nature of the articles themselves — the latter simply a list of names — will lead to differing editing 
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FIG. 2. Solid line: distribution of consecutive single-user C ("cooperative") events in George_W._Bush. 
The contrast to the multi-user case is clear, showing that long periods of cooperative editing can not be 
accounted for by unbroken single-user patters. The distribution is well-modeled by a limiting form of ASY, 
Eq. 8, with distinct functional form and parameter values from the ASY fit for the multi-user case. The 
limit- ASY fit is preferred to the finite-state nEXP model at AS « 7.6 (p < 1(T 3 ). 

patterns. Users may be unlikely to take numerous consecutive edits to adjust the list article, and 
will have reduced probabilities of interaction. Behaviors are different despite the (presumably) 
similar composition and norms of the underlying communities. 

An extreme version of the second hypothesis would be to claim that all of the non-fmite-state 
computation comes from non-interacting users who independently and separately come into contact 
with an article. The interactions between individuals, on this picture, are unimportant; the content 
of the page itself serves as an effectively unbounded resource that allows violation of the exponential 
cutoffs required by the finite-state case. 

For example, upon interacting with the page cooperatively (C), the user might alter it in such a 
way as to make the probability of a second cooperative edit (by the same user) more likely, and so 
on. Such a process could potentially lead to behaviors of the same nature as those accounted for by 
the ASY model, without having anything to do with any interpersonal or group-level interaction. 

Fig. 2 examines this question in detail for the George_W._Bush case. We now augment the time- 
series with an additional symbol, N, representing a change of user (for example, for the data shown 
in Table I, the new series would be CNCNRNCNCNRNCCCCC), and count strings of consecutive 
Cs bracketed either by R or N; in other words, a change of user is considered to interrupt the run 
of Cs. We find the ASY model strongly disfavored compared to the nEXP model, while a limiting 



is preferred at the 10 -3 level over nEXP. We caution that this non-exponential form is not necessar- 
ily evidence for non-finite computation in any particular individual, since the limit- ASY distribution 
found for the collection could be understood as the superposition of finite-state machines drawn 
from a distribution representing the spread of the properties of individuals. 

The distinct functional form of the distribution at the individual level suggests that some aspect 
of interpersonal interaction plays a role in the non-finite nature of the full process. Whether this 



form of ASY, limit-ASY, 




i=2 



(8) 
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is driven by how groups are more able to take advantage of the effectively unbounded resource 
of the page itself (a "large scratchpad" model), or because some system memory is encoded in 
the interactions between the users themselves (an "interaction combinatorics" model) is an open 
question. 

As a further check on the robustness of our results, we do a bootstrap resampling of the single- 
user sequences, remove the 'N' symbols, and analyse the subsequent distribution. The resultant 
time-series show no evidence for violation of the pumping lemma (on average, slight evidence 
against ASY, (AS) equal to —1.2). This is in constrast to the strong evidence for ASY in the 
original series (AS of +17.5), and is a further demonstration of the role of interaction effects in 
driving the violation of the finite-state case. 

An obvious visual difference between Figs. 1 and 2 is the elimination of the long tail; it so turns 
out that long cooperative runs are multi-user events. While it is not the case that long cooperative 
events necessarily imply the ASY over the nEXP model (they can be found as well in the "death 
list" pages, where they are fit by a single long timescale exponential component), it is certainly 
true that the exponential decays implied by the probabilistic pumping lemma require increasingly 
unlikely fine-tunings of amplitude and decay constants to fit long periods of cooperative behavior. 

The difficulties caused by this fine-tuning can be seen by examining the properties of explicitly- 
formulated maximum-likelihood finite state machines. For the George_W . _Bush timeseries, we 
use a standard implementation [25] of the Expectation-Maximization algorithm [26] to find the 
particular finite-state machine most likely to have generated the observed (multi-user) system. We 
then measure N(RC k R) over multiple realizations of the output of these best-fit machines. 

For machines with eight states (and thus a total of sixty- four underlying parameters), we find 
the predicted distribution N(RC k R) disfavored relative to the ASY model by AC of 36.9 (or 
p = 10~ 16 ). The majority of the relative log-likelihood penalty comes from longer cooperative 
runs. For shorter runs (less than twenty consecutive cooperations), the contribution to AC is 3.57 
(p = 0.03); for runs between twenty and eighty Cs, a total of 87 events, AC is 10.6 (10 -5 ); and a 
single long cooperative event, of length 103 contributes AC of 22.7 (10~ 10 ) [27]. It is these more 
extreme cases in the long tail that are hardest to account for by reference to the finite-state case. 

Recent work on the symbolic dynamics of bird song is relevant to the discussion of long tails and 
finite-state processes. Once regarded as strictly finite-state [28], the sound sequences produced by 
songbirds are now recognized to show features of non-finite-state computation. A recent, compact 
model of song production in the Bengalese finch (Lonchura striata domestica) [29], demonstrates 
the need for a self-modifying (and thus non-finite-state) Markov process; an analysis of data on a 
different species, the Zebra finch (Taeniopygia guttata), shows that the probability of an additional 
repetition, the analog of this paper's P(C k+1 )/P(C k ), decreases exponentially [30]. This is, of 
course, the other way to violate the probabilistic pumping lemma: the exponential of the lim- 
sup, Eq. 1, going to zero as opposed to unity. It is just as much evidence against finite-state 
computation, but found in the anomalous absence, rather than presence, of extreme events. 

The absence of a violation of the probabilistic pumping lemma is not evidence against non- 
finite-state computation. Even in the case of infinite data, it is easy to construct non-finite-state 
processes that show exponential decay in all repeated strings; an example can be constructed 
for a stochastic context-free language that generates strings of matched, but arbitrarily nested, 
parentheses: "...()((())())...". We defer detailed discussion of this question, and the extension of 
statistical constraints on stochastic grammars, to later work. 
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IV. CONCLUSIONS 

This work has examined how cooperative behavior in a large-scale social system exceeds the 
finite-state case. Among other things, it strongly implies that game-theoretic models of cooperation 
that rely on finite-state models (such as the base-case tit-for-tat strategy for iterated prisoner's 
dilemma [31]) may yet be incomplete accounts of real- world cooperative behavior. The results 
of Sec. Ill further suggest that distinct mechanisms for the violation of the finite-state case are 
associated with, on the one hand, the cognitive properties of individuals taken separately, and on 
the other, the fundamentally social phenomenon of Wikipedia as a whole. 

Much work remains to be done to determine the nature of this violation. The computational 
complexity of the process may be fundamentally connected to reputation or memory effects beyond 
the finite state [32-34] ; alternatively, "rational actor" models may be insufficient and full accounts 
may require attention to the emergence of social norms [35]. 

Our work here has taken a rigorously functionalist viewpoint. Under this systemic view, what 
is important about a social system is not its particular material instantiation, nor even its history 
of formation or the psychological states of its individuals, but rather the way in which the sum of 
all these facts locate the system in an abstract state space, and the ways in which the historically, 
psychologically and materially-determined interactions are arranged so as to determine this more 
abstract system's future evolution. 

In the modern version, the functionalist viewpoint leads us to give computational accounts of 
thought. This is most clearly seen in modern linguistics, where the biochemical processes that un- 
derlie the comprehension and production of language are reformulated as an abstract mathematical 
object: the formal languages of theoretical computer science. Formal language theory has been 
extended beyond the human language case to describe human social behavior (see, e.g., Ref. [36] 
on "shaking hands"), animal communication [11, 37] and animal behavior [38] and pattern recog- 
nition more generally (Ref. [10] and references therein). This joins the empirical study of cognitive 
phenomena to a long tradition in the theory of complexity [9]. 

In general, computational accounts are indifferent to notions of the individual. When the state 
of a group is taken to be the sum of the states of the individuals that compose it, coarse-grainings of 
the system state will in general lead to effective theories [39] whose basic units are not descriptions 
of the state of any one individual. In the cognitive sciences the possibility of going beyond the 
boundaries of an individual agent when producing an account of a large-scale computation is 
sometimes called Wide Computationalism [40, 41] or the Extended Mind Hypothesis [42, 43]. 

We have previously given such accounts in the case of an animal system [44, 45], where a single 
formalism is used to attribute computational ("strategic") states to both individual animals and 
emergent groups. Ref. [46] provides an explicit analogy between the formal language hierarchy and 
the decompositions of Ref. [44]. 

Our work in this paper takes the necessary and logical step of extending this account to human 
social systems, considered not as ensembles of individual (formal) language users but as a free- 
standing and unreduced process. Over and above its role in the discussion about cooperative 
phenomena in social systems, our main result presents a challenge to theory: what formalisms are 
most natural for the description of non-finite-state processes in the physical world? 

Our discussion in Sec. Ill demonstrates that empirical study itself can play a role in determining 
the relative importance of different ways a system can transcend the finite-state aspects of a system: 
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large scratchpads vs. interaction combinatorics. While formal language theory presents us with a 
number of "post-finite" languages, such as the context-free grammars and pushdown automata [47], 
it seems likely that these will have to be extended or modified to provide tractable models for 
empirical investigation. 
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V. APPENDIX: PROOF OF THE PROBABILISTIC PUMPING LEMMA 

Statement of Lemma. For any probabilistic finite-state process, and any initial distribution over 
internal states, if there exists a p such that for all k > p, P{w k ) > 0, there exists a positive real 
number e, < e < 1, such that exp lim^oo sup (l/k) log P(w k ) = e as k becomes large. 

Proof. We will assume the Mealy machine formalism (observed symbols are emitted upon 
transitions between internal states [49]). Let A be the transition matrix for the process; an element 
Aij(o~) gives the conditional probability of a transition to state j, emitting symbol a E S, given 
that one was previously in state i. If the process is reducible, we will assume that sufficient time 
has passed for the process to reach irreducible subspace of this matrix, and we confine our attention 
to that subspace. 

We may extend the definition of A(o~) to words, as 

Aij(w) = ^2 ^(^o)iai^.(^i)aia 2 • • • A ( w \ ti>\ )a,\ w \j ' 

where Wi is the zth symbol in word w. We have, further, given assumptions, 

0<^»<4j', (9) 

or, in words, the probability to go from state i to state j and emit the word w is less than or equal 
to that of simply going from i to j in the same number of steps. 

By the Perron- Frobenius theorem, the inequality of Eq. 9 implies that all eigenvalues, of 
Aij(w) are within the unit circle (|/3j| < 1 for all i) with equality obtaining only in the case that 
Aij(w) is identical to A^ . We neglect this latter, trivial case, which only obtains when w is 
shift-invariant and the all observation runs are given by repeated instances of w. 

If the system (or our knowledge of it) is distributed over its internal states according to proba- 
bility vector 7Tj, we can write the probability of observing a repeated string w as a trace, 

n 

P(w k )=Y,^(w). (10) 

Eq. 9 implies that all eigenvalues, of Aij(w) are within the unit circle (|/3j| < 1 for all i) with 
equality obtaining only in the case that Aij(w) is identical to A^. In this latter case, of course, 
the system produces the same word w, and only that word, repeatedly and without variation, and 
so P(w k ) is (in the case that w is shift-invariant) trivially one. Excluding this case, which can be 
read easily off the data, we may take the bound to be a strict inequality. 

While we have assumed for simplicity that Ay is irreducible, this will not usually be the 
case for Aij(w). This latter matrix will in general contain both essential and inessential "self- 
communicating" classes [50] along with a set of nuisance indices that connect to no other class 
(i.e., i for which Aij(w) is equal to zero for all j) [51]. 

The structure of Aij(w) may be visualized as a directed acyclic graph. Inessential classes may 
have non-zero out-degree, while essential classes, and nuisance indices, are the terminal nodes. 
Self-loops are permitted, and exist for both inessential and essential classes; these will be crucial 
to our argument below. 

Because the initial distribution ir may have zero entries, we consider only the part of Aij(w) 
corresponding to descendants of the non-zero part of it in the associated directed acyclic graph. 
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Transitions among the set of nuisance indices, by definition, can not repeat an index. Thus their 
structure is not relevant to the asymptotic behavior of P(w k ), and we may focus on the essential 
and inessential classes. 

We are particularly interested in the classes that will dominate the P(w k ) probability as k 
becomes large. Consider the restriction of Aij(w) to a particular class a: i.e., construct a submatrix 
from Aij(w) using only i,j 6 a. Call this restriction ctij(w). Consider, similarly, the restriction of 
the distribution ir to this class. Then, the probability of producing k copies of w, while remaining 
in the class a, is 

M 

P{w k \a)=Y J ^ iq H q \ (H) 

i,q=l 

where (3 q is the qth. eigenvalue of a(w), and 

7T, = (12) 
9=1 

By construction of the equivalence classes, a is irreducible. Then, the largest eigenvalue of this 
matrix, f}\, is real, has a strictly positive eigenvector, and vr^ 1 ) is necessarily greater than zero. 
If aij(w) is acyclic, then the leading term in P(w k \a) can be written 

P(w k \a) = A,p k (l + £ A q (|) k j , (13) 

where A\ > and /3 q < /3\. 

If the period, d, of ctij{w) is greater than one, we will have additional eigenvectors associated 
with complex rotations of r, r exp 2irik/d, k = {1 . . . d— 1}. These will lead to additional oscillatory 
terms in both the leading order term and its corrections. The oscillations of the leading term will 
be governed by an overall exponentially-decaying envelope, so that 



lim sup ( — log P(w k \a 
fc->-oo V k 



01, (14) 



cxp 

regardless of the period of ay (to). 

Having understood the single-class case, we now consider w k strings generated by multiple 
classes. 

Any particular string w k may be generated by a set of transitions within and between classes. 
Because these transitions are governed by the directed acyclic graph structure, there will be a finite 
number of transitions between states. Thus, as k becomes large, the probability of P(w k ) for a 
particular set of transitions will be governed by the self-transitions, given by terms of the form 
Eq. 14. 

In particular, P(w k ) is the sum of a finite number of terms; each term in the sum is a product 
of at most p transitions between classes, and at least k—p terms of the form P(w n \a), for different 
a. Explicitly, 

N 

P(w k )= ^ TtllPiw^laj), (15) 

iep(G) j=l 
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where i indexes the paths of length k through the graph G representing the underlying Aij(w) 
structure, Tj is a prefactor governing the probabilities of transitions between classes, N is the 
number of classes, and the total number of within-class transitions is forced to grow with k, 



for all possible paths i. 

For large k, the growth in the number of possible paths (i.e., the growth of the |p(G)|) is 
bounded by the growth in the number of ways to partition the sum in Eq. 16. In particular, for 
large k, the number of possible paths relevant to P(w k ) can increase only polynomially in k [52]. 

For large k, each term in the sum of Eq. 15 is decreasing exponentially, governed by products 
of the /3j i, the largest eigenvalues for the classes that have self-transitions for that term. The 
dominant terms in the sum will be those for which the exponential decline is slowest. By the 
Perron- Frobenius theorem, the largest eigenvalue of a submatrix associated with a class of Aij(w) 
is equal to the spectral radius of the matrix as a whole. If P(w k ) is greater than zero for k larger 
than p, the pigeonhole principle invoked in the ordinary pumping lemma [53] allows us to assume 
the existence of at least one self-communicating class; this then means that the spectral radius is 
equal to that of Aij(w) itself. 



N 




(16) 



3=1 



< exp lim sup ( — logP(u; fe ) 



k— >oo \ K 



p(Aij(w)) < 1, 



(17) 



which was to be proved. 



□ 
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FIG. 3. Numerical study of convergence of repeated word frequencies to exponential decay with cutoff 
predicted by the spectral radius. Shown here is the the measured decay rate to the asymptotic limit 
predicted by Eq. 17, for irreducible finite-state processes with ten states, two output symbols {C, R}, w 
equal to C, and a uniform distribution over values of p(Aij (w)), the spectral radius and asymptotic decay 
rate, between < p < 1. Light blue shows 2a, and dark blue la ranges about the median value. For 
empirical work, convergence is much faster when considering [P(w q+k ) / P(w q )] 1 / k , with q larger than the 
(assumed) number of states. 



VI. APPENDIX: NUMERICAL TESTS OF CONVERGENCE PROPERTIES 

With a view towards determining how the lemma of the previous section applies to actual finite- 
state processes, we study a restricted class of machines numerically. We sample from the space of 
probabilistic unifilar machines with p states over a two-symbol alphabet. Such a system can be 
represented by a weighted, directed graph, with each node having at least one, and at most two 
outgoing edges, each of which is associated with one of the two symbols, and whose weights sum 
to unity. 

For small p, this space can be described completely: for each node, we have a choice of one vs. 
two outgoing edges; in the case of only one outgoing edge, we must choose between the two symbols. 
Neglecting the possibility of equivalent machines, we then have the number of such machines, as a 
function of p, as 

N(p) = (2p + p 2 ) p , (18) 

which grows rapidly: there are 12 billion such machines with six states, and more than 10 400 with 
one hundred states. 

We are most interested in how quickly the statistics of an actual machine approaches the limiting 
value given by Eq. 17. For any particular Aij(w), we can compute the spectral radius and compare 
that to the ratio P(w k ) / P(w k ^ 1 ) found for distributions over initial conditions that include a 
self-communicating class as a function of k. 

In Fig. 3 we show convergence to the limit by sampling the space of strongly-connected ten-state 
machines, and considering the frequency of a single repeated symbol. We take a uniform prior over 
p(Ai(w)), the spectral radius and limit established by the lemma of the previous section, and show 
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FIG. 4. Convergence to exponential cutoff as seen with C(q,k) (Eq. 20), for the same system as in Fig. 3. 
Here we take q equal to ten, the number of states. For the same amount of data, convergence is faster for C 
than C; here convergence for C to the asymptotic value (at la confidence), is achieved for k equal to thirty. 



the convergence ratio, i.e., 



C(k) = [ /7 1 , (19) 



to provide a numerical example of the limiting process established in the previous section. For small 
k, P{w k ) may be dominated by movement through nuisance states and inessential classes, and by 
contributions from essential classes that have small self-communication probability. Convergence 
to the spectral radius thus occurs much faster when considering 

SW) = 2^f, (20) 

where q is longer than the relevant scales of the transient phenomena (e.g., at least as large as the 
assumed number of states.) This is shown explicitly in Fig. 4, where we take q to be the number 
of states in the system. 
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VII. APPENDIX: WIKIPEDIA ANALYSIS 

Our coarse-graining of behaviors on any particular page aims at locating where one user reverts 
(undoes) the contributions of another editor completely. We locate reversion edits in two distinct 
ways. Firstly, following Ref. [23], we can identify reversion edits by the presence of keywords, such 
as rv and revert, in the edit summaries; we do so with the following regular expression: / ( [Rr] v 
I [Uu]ndid| [Rr] evert)/. Secondly, following analyses such as those of Ref. [19], we can look for 
versions of a page with identical SHAl checksums; the version with the later timestamp may thus 
be considered a revert to the earlier page. In general, these two metrics align very well, although 
not perfectly; in this work, we focus on the latter method as a more objective one that does not 
rely on editors self-reporting. We do not include self-reverts, or edits that do not alter any aspect 
of the page (i.e., that would otherwise look like "reverts to the current version"). 

A feature of the naive classification of non-revert edits is the presence of so-called "vandalism" - 
improper and non-constructive modifications or blanking of the page, which (since they usually 
do not take the form of reversion) would be classed as cooperative. As noted in the main text of 
Sec. II, this does not undermine the use of the resultant series to test for non-finite computation. 
More detailed descriptions ("prosocial non-revert " vs. "antisocial non-revert") and similarly for 
the revert case, where pro-social reverts repair vandalism) are certainly possible, and, from the 
point of view of a detailed understanding, extremely desirable. However, this fine-graining of the 
time-series can only increase the complexity of the process. 

The probabilistic pumping lemma works in terms of P(w k ), and our analysis considers the 
probability of repeated cooperation. However, the measurement of P(C k ) in the data, if done 
naively, leads to unacceptable results. In particular, estimating P(C k ) for a particular page by 
counting the number of times the string C k appears in the time-series, leads to strong bin-to-bin 
correlations, since an observation of a string C k necessarily leads to observations of strings of the 
form C k ~ 1 , C k ~ 2 , . . . , (7 fc -L fc / 2 J+ 1 i an d then two observations of the form C^ - ^/ 2 ! , and so on. This 
would lead to excessive complications in the likelihood analysis; conversely, if the correlations are 
neglected, it leads to claims of heavy-tailed distributions that spuriously rule out exponential decay. 

Instead, we count prefix-suffix- free strings that do not have this shift problem — in particular, 
we consider the quantity N(RC k R). As long as N(RC k R) is significantly less than N, counts of 
RC k R and RC m R are independent of each other and we can write 

P(RO * R) „ "if*. 

The quantity P(RC k R) itself can be written as 



P(RC K R) = P{R)P(C K \R)P(R\RC K ) = P(R)P(C k \R) 1 - P(C\RC 

= P{R)\p{C k \R) - P{C k+1 \R) . (21) 

In the case that P(C k \R) is the sum of exponentials in A;, we have 

N(RC k R) oc P(C k \R) oc P(C k ), (22) 

or, in words, that if P(C k ) is a sum of exponentials, so is N(RC k R). The relationship between 
these two quantities is not always so simple; in the ASY case, as well as the power-law and q- 
Pochhammer case, Eq. 21 implies that the quantity N(RC k R) has a different functional form from 
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P{C k ) (though never a simple exponential cutoff). In particular, we have 



k 



^w-o^5=n(i-#). w 

1=1 



^ = A {^-jkh^)-^ (24) 

and 

k 

AP qP =pq k+1 Yl{l- Pq k ). (25) 
i=l 

In words, Eq. 23 shows that measurement of N{RC k R) leads to power-law behavior in the long 
tail. Conversely, note that the g-Pochammer function leads to an exponential cutoff. 



VIII. APPENDIX: DETAILS ON MODEL SELECTION 



In this section we describe in greater detail our methods for distinguishing between the asymp- 
totic and exponential models. 

Computation of the likelihood ratio requires an error model for the distributions of N(RC k R), 
such as those shown in Fig. 1. Since we lack an explicit model for the errors themselves, as a 
first approximation, we take measurements of N(RC k R) to be identically and independently dis- 
tributed. For N(RC R) <C N, N the total number of observations, this is a reasonable assumption. 
Given this assumption, the Poisson distribution of counts follows, and computation of the log- 
Likelihood, or log P(D\w, M), for any particular model M with parameters w, can be written 
as 

«anax 

AC = N(RC k R) log X(w, k) - \{w, k), (26) 

k=l 

where we drop model-independent constants. Given sufficiently flat priors, P(w\M) around the 
peak of this function, this is sufficient to estimate many quantities of interest, including the maxi- 
mum a posteriori values of w and the error bars on those estimates. 

Our main goal, however, is not parameter estimation, but rather model selection, where one 
compares models with different sets of parameters. In our particular case, one class of models 
(nEXP) can approximate, by superposition of exponentials, the other class (ASY). As the number 
of exponentials in the sum increases, the approximation becomes increasingly good. We would like 
to know when we are justified in preferring the more parsimonious model. 

Two main frameworks for the resolution of this question exist. On the one hand, the Aikiake 
Information Criterion (AIC) can be used to estimate the expected KL divergence between the 
predictions of a model and the "true process." In the limit of large amounts of data, it prescribes 
a constant penalty of n, the number of parameters, to the likelihood. 

This penalty is sometimes taken as an "Occam penalty," but the correct interpretation is as a 
guide for prediction out of sample. Prediction out of sample is a conceptually distinct problem, 
since a complicated approximation to the true model may work very well in a limited range, 
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particularly in the presence of experimental noise; in Monte Carlo testing, AIC is insufficiently 
rigorous, and tends to prefer complicated approximations [ ]. A well-known formal result is that 
AIC is "dimensionally inconsistent," meaning that even in the limit of infinite data, use of the AIC 
will lead to non-zero probability of choosing an (incorrect) approximation [55]. 

On the other hand, one can compute (or approximate) what is called the Evidence [56], which 
requires knowledge of both the likelihood, P(D\w, M), and the prior expectation of parameter 
ranges, P(w\M), 

E = P(D\M) = J P(D\w,M)P(w\M) d k w, (27) 

where k is the number of parameters (dimensionality of w). Formally, the Evidence is proportional 
to "the probability of the model M, given the data observed," if equal prior probability is given to 
the models under consideration. As in all model selection cases, absolute values of the Evidence 
are irrelevant. One considers only ratios and phrases the question, as in Table II, as to whether 
(for example) "model A is at least a factor of 10 3 more likely than model B." 

In this work, we take the latter approach, operating entirely within the Bayesian framework. 
This is because our contrasting model classes have small numbers (less than ten) of parameters, all 
of which have clearly specifiable priors, P(w\M). Computation of the full posterior is now common 
when these circumstances obtain, as is often the case in the exact sciences [57-59]. 

In order to calculate E, we use the Laplace (or saddle point) approximation; in log-units, 

£ = logE rs C(w max ) + log P(-uJ max | M) 
--logdet^4+ -fclog27r, 

where C is the log- likelihood, w ma _ x are the parameters that maximize the likelihood, and A is the 
Hessian, equal to 

^ dwidwj ' 

We refer the reader to Ref. [60] for details on this approximation. 

It remains to specify the priors P(w max \M) for the two models. The nEXP class has 2n 
parameters; the ASY class has 3. The parameters are of two kinds. 

Both nEXP and ASY have a parameter corresponding to the one-step decay of the underlying 
quantity P(C k ), in the case of nEXP, there are n such parameters, fy, that play this role; these 
correspond to the eigenvalues of the transition matrix Aij(w). In the case of ASY, there is only 
one, p. We take a uniform prior in p (ASY) and fit (nEXP). We allow all p to range independently 
between zero and 0.995; the high end corresponds to an exponential cutoff of order 200 repeats, 
much longer than seen in the data. 

We then have normalizations of terms (n normalizations for nEXP, one for ASY). These are 
fixed by the value of P{C l ), the overall cooperative fraction. 

N{C) ph NP(C). (29) 



The maximum value of P(C) is unity. This then leads to an overall area factor of 



21 



for nEXP, where the factor of re! is because the overall sum of all normalizations is confined to the 
interior of an n-dimensional simplex. In the case of ASY, P^C 1 ) is equal to A(l — p). We thus 
have to integrate over the range of p values to find the area associated with the ASY normalization 
prior, 



Finally, ASY has a third parameter, a. For each value of 1 — p, we allow this to range between 
zero (pure exponential) and a(p), where a(p) is set to give a 1/e cutoff at 200 repeats. As an 
example, a(0.995) is zero; if a were greater than zero, the overall function would have an exponential 
cutoff longer than 200 repeats. Given these, the area factor for nEXP is 0.995 n , and for ASY is it 



Putting together all these area factors, we can then pre-compute — log -A, equal to log P(w max \M), 
a constant independent of w. For the George_W. _Bush article, for example, we have — log A equal 
to -12.6 for the ASY case, and -10.3 (1EXP), -18.7 (2EXP), -27.4 (3EXP). Note that prior 
areas are not directly comparable between different models; "change of units" (e.g., working in 
terms of P(RC k R) vs. N(RC k R)) will scale A. This scaling, however, is directly compensated for 
by the Hessian determinant term. 

Together with the max log-likelihood, the determinant of the Hessian, and the +/clog2-7r, these 
are sufficient to compute the (Gaussian approximation to) the relative log-Evidence for the two 
model classes A£, reported in Table II. In general, the highest evidence member of the nEXP class 
is either 3EXP or 4EXP. Table III gives the results for the top thirty most-edited pages. 
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0.995 



a(p) dp « 1.28841. 
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sig. 


page name 


history length 


AC 

ASY vs. nEXP 


< io- 8 


George_W. _Bush 


44,984 


18.6 


< 1CT 6 


World_War_I 


14,515 


15.9 




Islam 


17,586 


14.8 


< 1CT 5 


Iraq_War 


14,785 


12.8 




Scientology 


14,468 


12.2 




United_States 


30,715 


12.2 




Global_warming 


19,376 


12.1 


< IO" 4 


Australia 


13,574 


11.4 




Wikipedia 


31,591 


11.3 




September_ll_attacks 


17,078 


11.3 




Israel 


16,036 


11.1 




Super _Smash_Br os . _Br 


15,300 


11.1 




Turkey 


13,703 


11.0 




Gaza_War 


14,654 


10.7 




List_of _Omnitrix_ali 


16,263 


10.6 




MichaeUackson 


26,662 


10.4 




Canada 


17,441 


9.4 




Blink-182 


13,839 


9.3 


< 1CT 3 


2006_Lebanon_War 


19,510 


8.8 




Blackout _(Britney_Sp 


15,637 


7.9 




Deaths_in_2009 


20,902 


7.7 


< 10- 2 


Heroes_(TV_series) 


13,980 


6.6 




Xbox_360 


16,465 


6.4 




Lost_(TV_series) 


14,500 


5.1 




Paul J4cCartney 


16,453 


4.7 


(no det.) 


Eminem 


17,071 


4.3 




Pink_Floyd 


15,606 


2.9 




Deaths_in_2006 


14,072 


0.8 


> 10 4 


Deaths_in_2007 


18,215 


-11.5 


> 10 7 


Deaths_in_2008 


19,072 


-17.5 



TABLE III. log-Evidence ratios for the thirty most-edited pages on Wikipedia. 



